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(54) Method and apparatus for generating query responses in a computer-based document 
retrieval system 



(57) The present invention relates to a method and 
apparatus for generating responses to queries to a doc- 
ument retrieval system. The system responds to a spe- 
cific request for information by locating and ranking por- 
tions of text that may contain the information sought. It 
locates small relevant passages of text (called "hit pas- 
sages") and ranks them according to an estimate of the 
degree to which they correspond to the information 
sought. The system minimizes the number of these hit 
passages that need to be examined before an informa- 
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tion seeker has either found the desired information or 
can safely conclude that the information sought is not in 
the collection of texts. A relaxation ranking mechanism 
is provided to accommodate paraphrase variations that 
occur between the description of the information sought 
and the content of the text passages that may constitute 
suitable answers, by retrieving phrases that are dissim- 
ilar to the query phrase to different degrees according 
to a predefined set of rules, and penalizing the retrieved 
phrases based upon the degree of this dissimilarity, thus 
providing the user with a priority organized query hit list. 
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Description 

The present invention relates to a method and ap- 
paratus for generating responses to queries to a docu- 
ment retrieval system. When a large corpus (database) s 
of documents is searched for relevant terms (query 
terms), it is desirable to find small relevant passages of 
text (called "hits" or "hit passages') and rank them ac- 
cording to an estimate of the degree to which they will 
providing the information sought. 10 

If the document database is very large, the number 
of hit passages generated may be far too high to be help- 
ful to the user. Mechanisms are needed to minimize the 
number of hit passages that a user must examine before 
he or she either has found the desired information or is 
can reasonably conclude that the information sought is 
not in the collection of texts. 

This type of specific, •fine-grained" information ac- 
cess is becoming increasingly important for on-line in- 
formation systems and is not well served by traditional 20 
document retrieval techniques. The problem is exacer- 
bated with the use of small queries (of only a few words), 
which tend to generate larger numbers of retrieved doc- 
uments. 

When both the query and the size of the target (hit) 25 
passage are small, one of the challenges in current sys- 
tems is that of dealing effectively with the paraphrase 
variations that occur between the description of the in- 
formation sought and the content of the text passages 
that may constitute suitable answers. Literal search en- 30 
gines will not return paraphrases, and therefore may 
miss important and relevant information. Search en- 
gines that allow paraphrases may generate too many 
responses, often without an adequate hierarchical rank- 
ing, making the query response of minimal usefulness. 35 

Thus, another challenge which is not currently well 
met is the effective ranking of the resulting hit passages. 
A high-quality ranking of matching document locations 
in response to queries is needed to enhance efficient 
information access. 40 

Classical information retrieval (also called "docu- 
ment retrieval") measures a query against a collection 
of documents and returns a set of "retrieved" docu- 
ments. A useful variant (called "relevance ranking") 
ranks the retrieved documents in order of estimated rel- *s 
evance to the query, usually by some function of the 
number of occurrences of the query terms in the docu- 
ment and the number of occurrences of those same 
terms in the collection as a whole. 

Document retrieval techniques do not, however, at- so 
tempt to identify specific positions or passages within 
the retrieved documents where the desired information 
is likely to be found. Thus, when a retrieved document 
is sufficiently large and the information sought is specif- 
ic, a substantial residual task remains for the information ss 
seeker; it is still necessary to scan the retrieved docu- 
ment to see where the information sought might be 
found, if indeed the desired information is actually 



present in the document. A mechanism is needed to ad- 
dress this shortcoming. 

In most previous information retrieval procedures 
for passage retrieval, a passage granularity is chosen 
at indexing time and these units are indexed and then 
either retrieved as if they were small documents or col- 
lections of individual sentences are retrieved and as- 
sembled together to produce passages. See Salton et 
al., "Approaches to Passage Retrieval in Full Text Infor- 
mation Systems,' ' Proceedings of the Sixteenth Annual 
International ACM SIGIR Conference on Research and 
Development in Information Retrieval (SIGIR 93) (incor- 
porated herein by reference), ACM Press, 1993, pp 
49-58; Callan, J.R, "Passage-Level Evidence in Docu- 
ment Retrieval," Proceedings of the Seventeenth Annu- 
al International ACM-S/G/R Conference on Research 
and Development in Information Retrieval (SIGIR 93) 
(also incorporated herein by reference), Springer-Ver- 
lag, 1994, pp 302-310; and Wilkinson, R., "Effective Re- 
trieval of Structured Documents," (also in Proceedings 
of the Seventeenth, etc., at pp 31 1 -31 7). It would be use- 
ful to have a system that dynamically sized passages 
for retrieval based upon the degree to which the re- 
trieved passage matches the query phrase. 

Recently, a different approach has been proposed, 
based upon hidden Markov models and capable of dy- 
namically selecting a passage. See Mittendorf et al., 
"Document and Passage Retrieval Based on Hidden 
Markov Models, " (Proceedings of the Seventeenth, etc. , 
pp 318-327). However, this approach does not deal with 
the entire vocabulary of the text material, and requires 
reducing the document descriptions to clusters at index- 
ing time. It would be preferable to have a system that 
both encompasses the entire text base and does not re- 
quire such clustering. 

Aspects of the invention are set out in the accom- 
panying independent claims. Preferred and further fea- 
tures of the invention are set out in the dependent 
claims. Different combinations of the features of the de- 
pendent claims may be made, as appropriate, with the 
features of the independent claims. 

An embodiment of the invention provides for gen- 
erating responses to queries with more efficient and 
useful location of specific, relevant information passag- 
es within a text. The method locates compact regions 
('hit passages") within a text that match a query to some 
measurable degree, such as by including terms that 
match terms in the query to some extent ('(entailing) 
term hits"), and ranks them by the measured degree of 
match. The ranking procedure, referred to herein as "re- 
laxation ranking", ranks hit passages based upon the 
extent to which the requirement of an exact match with 
the query must be relaxed in order to obtain a corre- 
spondence between the submitted query and the re- 
trieved hit passage. The relaxation mechanism takes in- 
to account various predefined "dimensions" (measures 
of cbseness of matches), including: word order; word 
adjacency; inflected or derived forms of the query terms; 
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and semantic or inferential distance of the located terms 
from the query terms. 

A system according to the invention locates occur- 
rences of terms (words or phrases) in the texts (docu- 
ment database) that are semantical ly similar to terms in 
the query, so as to identify compact regions of the texts 
that contain all or most of the query terms, or terms sim- 
ilar to them. These compact regions are ranked by a 
combination of: their compactness; the semantic simi- 
larity of the located phrases to the query terms; the 
number of query terms actually found (i.e. matched with 
some located term from the texts); and the relative order 
of occurrence of the located terms compared with the 
order or the corresponding query terms. 

The identified compact regions are called "hit pas- 
sages, ' and their ranking is weighted to a substantial 
extent based upon the physical distance separating the 
matching terms (compared with the distance between 
the corresponding terms in the query), as well as the 
"similarity" distance between the terms in the hit and the 
corresponding terms in the query. 

The foregoing criteria are weighted and the located 
passages are ranked based upon scores generated by 
combining all the weights according the a predeter- 
mined procedure. "Windows" into the documents (vari- 
ably sized regions around the located "hit passages") 
are presented to the user in an order according to the 
resulting ranking. 

A significant advantage of relaxation ranking is that 
the system automatically generates and ranks hits that 
in a traditional document retrieval system would have to 
found by a sequence of searches using different com- 
binations of retrieval operators. Thus, the number of 
times the information seeker is unsatisfied by a result - 
and therefore needs to reformulate the query - is sig- 
nificantly reduced, and the amount of effort required to 
formulate the query is also significantly reduced. 

Another advantage is that the rankings produced by 
the current system are for the most part insensitive to 
the size or composition of the document collection and 
are meaningful across a group of collections, so that 
term hit lists produced by searching different collections 
can be merged, and the ranking scores from the differ- 
ent collections will be commensurate. This makes it pos- 
sible to parallelize and distribute the indexing and re- 
trieval process. 

In addition, a system according to the invention is 
more successful than traditional system at locating spe- 
cific, relevant passages within the retrieved documents, 
and summarizes and displays these passages with in- 
formation generated by the relaxation ranking proce- 
dure, so that the user is informed why the passage was 
retrieved and can thus judge whether and how to exam- 
ine the hit passage. 

An embodiment of the invention is particularly ef- 
fective at handling short queries, such as from two to six 
words. Accordingly, an embodiment of a retrieval sys- 
tem according to the invention may handle different que- 



ries differently, using conventional word search mecha- 
nism for searches based upon one-word queries or que- 
ries of more than six terms, and using the system of the 
invention for searched based upon two- to six-word que- 
5 ries. 

Embodiments of the present invention are de- 
scribed hereinafter, by way of example only, with refer- 
ence to the accompanying drawings, in which: 

Figure 1 is a block diagram of a system according 
10 to an embodiment of the invention. 

Figure 2 is a diagram of the interacting modules of 
an indexing and analysis system according to an em- 
bodiment of the invention. 

Figure 3 is an illustration of an exemplary search 
is result as generated by the system according to an em- 
bodiment of the invention. 

Figure 4 is a flow chart of a generalized method for 
query processing according to an embodiment of the in- 
vention. 

20 Figures 5-5A are flow charts illustrating a more de- 
tailed, preferred embodiment of the method of the in- 
vention. 

An example of system of the invention will first be 
described in terms of its overall, general functionality, 

25 including specific types of ranking and penalty criteria 
that are used and configurations of hardware and soft- 
ware suitable for implementing the invention. A specific 
manner of implementing the relaxation ranking method 
is presented, as well as examples of search results gen- 

30 erated by an actual implementation of the invention. 

SECTION 1 : The Apparatus of the Invention 

Figure 1 shows a computer system 10 implement- 
as ing the invention. The system 1 0 may be a conventional 
personal computer or workstation, including a processor 
20, a memory 30 storing the operating system, applica- 
tions and data files, a keyboard and mouse 40, and a 
display or other output device (such as a printer) 50. The 
40 precise configuration is not crucial; for instance, the 
memory 30 may be a distributed memory on a network, 
a shared memory in a multiprocessor, and so on. Output 
device 50 may alternatively and equivalently be a mass 
storage device or any device capable of receiving the 
4£ output file resulting from a search query, whether in text, 
graphical or other format, for storing, display or other 
types of output. In the present application, "display" will 
be used generally to encompass any of these possibil- 
ities. 

50 Input to the system, such as search queries, are 
made via the keyboard and mouse 40. In addition, 
search queries may be generated in the course of exe- 
cuting applications that are stored in the memory 30 and 
executed on the processor 20, or they may be received 
from remote hosts on a network or other communication 
channel. The source of the search queries is thus vari- 
able, the present invention being directed to the execu- 
tion of the searches and handling of the results. 
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Memory 30 stores software including instructions 
for carrying out the method of the invention, including a 
retrieval engine 60, which generally includes all program 
instructions or modules necessary to implement the in- 
vention. As will be appreciated in the following discus- 
sion, given the teaching of the present application it is 
a straightforward matter to generate programs or pro- 
gram modules to carry out the invention. 

Memory 30 also stores a document corpus 70, 
which includes all the documents in which a search is 
to be carried out, and a term occurrence index 80 com- 
prising an index of all, or some specified subset of, the 
terms within the document corpus, as described in fur- 
ther detail below. In addition, generator store 85 is a por- 
tion of memory 30 where the processor 20 temporarily 
stores information generated during the course of a que- 
ry response, before ultimately outputting the results to 
output buffer 90 (connected to the processor 20) for 
transfer to the display 50. 

The output buffer 90 is configured to store a user- 
defined or predetermined maximum number of hit pas- 
sages, as discussed in further detail below, or the total 
number of hits generated by a query response, if that 
total is not greater than the predetermined maximum. 
The hit passages, i.e. the regions of retrieved text that 
include term hits, are stored in a ranked order according 
to the method of the invention, described below. (Term 
hits' is used herein to refer to the individual terms that 
are relieved as somehow matching the query terms.) 

A proximity buffer 95 is also connected to the proc- 
essor 20, and is used by the processor to store positions 
and sizes of "windows" onto a target document - i.e., 
regions in a document, of dynamically variable sizes, 
currently being searched by the processor for terms that 
match the input query terms. A window may be specified 
as a starting location within a target document plus a 
size that determines how much of the document, start- 
ing from that starting location, is to be included in a hit 
passage. A hit passage is that portion of the document 
covered by such a window, and includes hit terms, i.e. 
the matching terms themselves. 

The hit terms and hit passages are also stored in 
the proximity buffer 95, correlated with the window in- 
formation. 

Figure 2 illustrates the how the program modules 
may be organized to carry out the indexing and analysis 
operations that are applied to the document corpus 70 
of text materials to be indexed in order to produce the 
term occurrence index 80 and the term/ concept rela- 
tionship network 110 used to support subsequent query 
operations. 

The term indexing module 90 constructs the term 
occurrence index 80 which is a record of all the terms 
that occur in the corpus 70 together with a record for 
each term listing the documents in which that term oc- 
curs and the positions within that document where the 
term occurs. This operation is a conventional operation 
in information retrieval. 



The terminology analysis module 100 analyzes 
each term in the corpus 70 to construct the term/concept 
relationship network 110, which is a corpus-specific se- 
mantic network of terms and concepts that occur in the 
5 corpus 70, or related terms and concepts that may occur 
in a query, together with a variety of morphological, tax- 
onomic, and semantic entailment relationships among 
these terms and concepts that may be used subse- 
quently to connect terms in a query with terms in the text. 

The construction of the term/concept relationship 
network 110 draws upon and makes use of a lexicon 
180 composed of a general purpose lexicon 190 of in- 
formation about general English words and/or words of 
some other language and a domain-specific specialized 
lexicon 200 containing terms and information about 
terms that are specific to the subject domain of the cor- 
pus 70. These lexicons contain information about mor- 
phological relationships between words and other infor- 
mation such as the syntactic parts of speech of words 
that are used by morphological analysis routines within 
the terminology analysis module 100 to derive morpho- 
logical relationships between terms that may not occur 
explicitly in the lexicon. The operation and use of such 
lexicons and morphological analysis conventional in 
computational linguistics. 

The construction of the term/concept relationship 
network 110 also makes use of a taxonomy 120 com- 
posed of a general purpose taxonomy 1 30 of taxonomic 
subsumption relationships (i.e., relationships between 
more general and more specific terms) that hold be- 
tween general words and concepts of English and/or 
some other natural language and a domain-specific 
specialized taxonomy 140 of subsumption relationships 
that are specific to the subject domain of the corpus 70. 
This operation also makes use of a semantic network of 
semantic entailment relationships 150 composed of a 
general purpose entailments database 160 of semantic 
entailment relationships (i.e., relationships between a 
term or concept and other terms or concepts that entail 
or imply that term) that hold between general words and 
concepts of English and/or some other natural lan- 
guage, and a domain-specific entailments database 1 70 
of semantic entailment relationships that are specific to 
the subject domain of the corpus 70. The operation and 
use of such semantic taxonomies and semantic net- 
works are conventional in the art of knowledge repre- 
sentation. See John Sowa (ed.), Principles of Semantic 
Networks: Explorations in the Representation of Knowl- 
edge, San Mateo: Morgan Kaufmann, 1991 (incorporat- 
ed herein by reference). 

Each of these modules is utilized by the preferred 
embodiment of the invention, in a manner to be de- 
scribed bebw, though different and equivalent configu- 
rations may be arrived at to implement the invention. 

SECTION 2: The Method of the Invention 

Figure 4 illustrates a generalized embodiment of the 
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method of the invention, and Figures 5-5A illustrate 
more specifically the steps taken according to the pre- 
ferred embodiment of the invention. 

2A . Basic Method: Ranking and Penalty Procedures 

Figure 4 corresponds to the twelve ranking and pen- 
alty procedures discussed below. At box 410, a search 
query phrase (consisting of one to many terms) is input, 
either entered by the user or requested by an executing 
process on the processor 20. Boxes 420-550 represent 
steps taken to penalize, rank and display the retrieved 
passages from the document corpus and are related to 
ranking procedures 1-12 listed below. The numerals in 
circles in Figure 4 indicate the correspondingly num- 
bered ranking criteria. 

In this more general discussion, the order of listing 
criteria/procedures 1-12 below and the order of boxes 
430-550 in Figure 4 do not indicate a required order of 
ranking or penalty assignments; rather, many different 
such orders are possible. 

The penalization and ranking criteria discussed be- 
low (especially those of procedures 1-7) are referred to 
herein as relaxation ranking criteria, since they allow for 
flexible ranking of retrieved passages of text. 

Procedure 1: Proximity ranking penalties. (Boxes 
420 and 470 of Figure 4.) Hit passages are identified as 
compact regions of text containing one or more matches 
for the query terms, and the hit passages are penalized 
depending upon how closely or far apart the matching 
terms occur together; i.e. the farther apart the located 
terms relative to their proximity in the query phrase, the 
higher the penalty. 

It should be noted that proximity penalization herein 
is not the same as the conventional information retrieval 
technique of using 'proximity operators," in which a user 
specifies a set of terms and a distance threshold within 
which occurrences of those terms must be found in or- 
der for a match to be counted. In the traditional tech- 
nique, the resulting hits are ranked by how many of the 
terms occur rather than by how closely the terms occur 
together, as in the present invention. 

Procedure 2: Permutation penalties. (Box 480 of 
Figure 4.) Hit passages are penalized by the degree to 
which their relevant phrases occur in a different order 
from the corresponding terms in the query phrase, using 
a measure of permutation distance between the order 
of the query terms and the order of their corresponding 
term hits. 

Procedure 3: Morphological variation penalties . 
(Box 430 of Figure 4.) Query terms are compared to 
terms in the target text that may be inflected or derived 
forms of the query terms, and are ranked by a small pen- 
alty factor so that exact matches are preferred over in- 
flectional or derivational variants, but only slightly so. 

Procedure 4: Taxonomic specialization penalties. 
(Box 440 of Figure 4.) Query terms are compared to 
terms in the text that are more specific according to a 



taxonomy listing generality relationships among terms 
and concepts, such as taxonomies 180 in Figure 2. 
Terms and concepts in the text that are more specific 
than terms and concepts in the query are automatically 

5 retrieved and may be ranked with a penalty for not be in g 
exact matches to the query. 

Procedure 5: Semantic entailment penalties. (Box 
450 of Figure 4.) Hit passages that contain terms with a 
high degree of "semantic - similarity to the query terms, 

10 or that logically entail the query terms, are penalized 
less than those with more remote semantic similarity or 
a lower strength of entailment. 

Procedure 6: Missing term penalties. (Box 460 of 
Figure 4.) Include hit passages that contain matches for 
some but not all of the query terms, and penalized them 
according to the number of query terms that are missing 
from the hit passage. In this way, when no complete 
matches occur, the user is automatically presented with 
information about the best matches that can be found. 

20 The hit passages are also ranked according to a deter- 
mination of the importance of the missing terms. 

Procedure 7: Overlap suppression. (Box 500 of Fig- 
ure 4.) Hit passages that overlap (i.e. occupy at least a 
portion of the same "window" onto a target document 

25 as) other hit passages with a better ranking are sup- 
pressed, i.e. discarded. Hit passages with the same 
ranking as another overlapping hit passage are likewise 
suppressed, since they add nothing to the overall rank- 
ing of the located document. 

30 Procedure 8: Positional ordering. (Box 510 of Fig- 
ure 4. ) All other factors being equal, hits with equal rank- 
ing scores are ordered primarily in order of a default pre- 
ferred document order, and secondarily according to the 
positions of given hit passages within the document in 

35 which they occur. 

Procedure 9: Dynamic passage sizing and internal 
boundary penalties. (Box 520 of Figure 4.) Hit passages 
are identified by a passage of text consisting of the 
smallest sequence of sentences containing the hit re- 

40 gion, or if the hit region is within a portion of text that 
does not have sentence structure (e.g., a table or a fig- 
ure), then the smallest coherent region containing the 
hit region. The terms within the current query passage 
that were specifically involved in determining the hit pas- 

45 sage are highlighted, if possible, when such identifica- 
tions are displayed. If a sentence ending (such as a pe- 
riod) or paragraph boundary occurs within a given hit 
passage, that passage is penalized. 

Procedure 10: Match summaries. (Box 530 of Fig- 

so ure 4.) Hit passages are summarized by a list of the 
terms in the hit passages that match the corresponding 
terms in the query, with specific identification of query 
terms that are not matched in each such hit passage. 
Procedure 11 : Ranking of lists. (Box 540 of Figure 

55 4.) When the query is processed, the user is presented 
with a ranked list of the term hits that have been discov- 
ered, each of which has a ranking score that reports the 
quality of the match (with lower overall penalty totals in- 
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dicating higher quality). Thus, each hit passage is iden- 
tified by a match summary and a display ot the passage 
of text that constitutes the hit. The term hits are listed in 
the order determined by combining the above ranking 
factors, and hit passages that are otherwise of equal 
rank are ordered according to their position in the corpus 
and text (i.e., hit passages in preferred documents are 
presented first and earlier hit passages within a docu- 
ment come before later hit passages). 

Procedure 12: Interactive massage access. (Box 
550 of Figure 4.) Each of the term hits in the result list 
includes at least one active button or hyperlink that can 
be selected in order to view the corresponding hit pas- 
sage in its surrounding context in the document within 
which it occurs. Hit passages are highlighted when 
viewed in the context of their occurrence, and the terms 
in the hit passage that resulted in the match are marked. 
The user can then move around within the document at 
will, and can return to the highlighted hit passage at will. 

Once the procedure 400 has executed the steps 
420-550, it is ready to begin with another query, as in- 
dicated at box 560 of Figure 4, and otherwise to stop, 
as at box 570. 

2B. Basic Method: Ranking by Physical Proximity and 
Similarity 

The basic method according to an embodiment of 
the invention is to find regions of the indexed text in 
which all of the query terms occur close together, or 
where most of the query terms (or terms similar to most 
of the query terms) occur close together. These hit pas- 
sages are graded by the relaxation ranking criteria and 
presented to the user in order of this ranking. 

For example, if a user has submitted a query to lo- 
cate the phrase "jump to end of file" in a document cor- 
pus (such as an on-line user's manual for a text editor 
application), a hit passage returned by the retrieval en- 
gine might be 'move the cursor to the end of the input 
buffer". In this case, the retrieved term "jump" corre- 
sponds to the query term "move* as a term with close 
semantic distance, and the intervening phrase the cur- 
sor" leads to a small penalty on the basis of a criterion 
comparing the compactness of the retrieved passage 
vis-a-vis the original query phrase. Another retrieved 
passage that does not include intervening words would 
not receive this penalty. 

In this example, the phrase the input buffer" corre- 
sponds to the query term tile" by some measurable en- 
tailment relation. As indicated above, entailment indi- 
cates that a query term is implied to some extent a re- 
trieved term; in this case, "input buffer" may be consid- 
ered to entail the virtual presence of the term tile". One 
term entails another if the latter is implied by the former; 
in general, the entailing term will be narrower or more 
specific than the entailed term, but will sometimes be 
essentially synonymous. (Thus, "bird" entails "animal", 
and "plumage" entails "bird".) 



The hit passage "jump to end of file" would be as- 
signed a quantitative rank on the basis of the overall 
length of the hit, the number of missing terms (if any), 
and the strength of semantic similarity or entailment be- 
s tween the aligned terms of the query and the corre- 
sponding hit passage. 

The method utilizes a term occurrence index 
(whose generation is discussed in Section 1 above) that 
can deliver the following information for each term of the 
query: 

1 . an enumeration of the set of all documents in the 
corpus that contain that term; 

2. for a given document, the positions (e.g., as byte 
offsets) within the document where the term occurs; 
and 

3. statistical information such as the number of oc- 
currences of the term in the collection, the number 
of documents in which it occurs, the number of 
times it occurs in each document, and the total 
number of documents and word tokens in the col- 
lection. 

The construction of such an index is a conventional op- 
eration in information retrieval. 

The method may further use facilities (also dis- 
cussed in Section 1 above) for obtaining stems or mor- 
phological variants of terms, semantically related terms, 
more specific terms, and terms that entail a term. Each 
of these related terms may have an associated numer- 
ical "similarity distance" between a query term and the 
retrieved term. This similarity distance is used as an as- 
sociated penalty to be assigned when matching a query 
term against the retrieved term. 

For example, for a query term "change", morpho- 
logical variants would include "changed", "changing" 
and "interchange"; a semantically related term might be 
"influence"; more specific terms would include "alter" 
and "damage"; and an entailing term might be "move - 
(since moving something entails a change of position). 
In the description below, these related terms will be gen- 
erally referred to as "similar terms" or " entailing terms", 
and numeric penalties are associated with each similar 
or entailing term based on the kind of association be- 
tween the query term and the entailing term, together 
with the similarity distance between the two terms. 

A "generator" is constructed for each term in the 
query. The generator is a data structure or database 
stored in memory that enumerates positions in docu- 
ments at which the query term or any of its similar terms 
occur. It is these occurrences of the query term or its 
similar terms that are referred to as the "(entailing) term 
hits" for that term. 

The documents in the collection are assigned an ar- 
bitrary order, such as the order in which they were in- 
dexed or preferably an ordering in which more popular, 
informative, or useful documents precede documents 
that are less likely to be useful. The generator for each 
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query term is initialized to generate the first occurrence 
of a term hit for that query term in the first document in 
the collection in which a term hit for that term occurs. 

Intuitively, the method proceeds by moving a win- 
dow through each document containing any of the term 
hits for any of the terms of the query, determining wheth- 
er that window contains a match for the query as a 
whole, choosing whether to extract a hit passage from 
that window, and if so then ranking the selected pas- 
sage. 

The size of the query window is determined by a 
(temporarily) fixed location parameter plus a window 
size parameter, determined as the product of a prede- 
termined factor multiplied by the length of the query. 
These two parameters can be manipulated by the infor- 
mation seeker or an executing process, or may be set 
to predetermined useful values. 

A window 300 onto a document 305 is shown in Fig- 
ure 3, and includes lines of text 310.1-310.11 including 
a hit passage 320 containing terms 320.1 -320,n (t1, 
t2, .... tn). The hit passage 320 has a beginning marked 
by a start position 330 and an end marked by an end 
position 340. 

The window 300 can move over the body of the doc- 
ument 305 to include different portions thereof. For in- 
stance, as it moves down relative to the text illustrated, 
it will omit line 310.1 and include line 310.12 (which 
would be the next line below 310.11), then omit line 
310.2 and include line 31 0. 1 3, and so on. The use of the 
window construct is presented in detail below. 

Other parameters (either predetermined or set by 
the user or a process) determine the weighting of each 
of the different dimensions of relaxation (e.g., proximity, 
permutation, morphology, taxonomy, entailment, and 
deletion), and two parameters specify penalties to be 
assigned if a hit passage contains a sentence boundary 
or a paragraph boundary. Each of these parameters can 
either be made available for manipulation by the infor- 
mation seeker or set to predetermined useful values. 
The ranking of a passages is determined by the net pen- 
alty that is the sum of its assigned penalties from various 
sources. 

2C. General Method for Generating Hit Passages in 
Order of Desired Ranking . 

The fol towing methodology gives a generalized pro- 
cedure for generating hit passages and for ordering 
them in a ranking that best reflects the search query. 
Further below is a discussion of a specific implementa- 
tion of this methodology. 

Let the query q be a sequence of terms q1 , q2, .... 
qm, each of which is a word or phrase, and let x be a 
text document including a sequence of words x1 ,x2,..., 
xn. A term-similarity distance function is used that as- 
signs to ordered pairs of terms (p, p') a distance meas- 
ure d = d(p, p'), where p and p' are terms and d is a 
similarity distance between the terms. 



A similarity distance of zero will represent identity 
or full synonymy of the terms, or some other circum- 
stance in which no penalty is assigned to matching que- 
ry term p to text term p\ Larger similarity distances will 

5 correspond to terms that are only partially synonymous 
or otherwise related - e.g., because one is more general 
than another or entailed by the other, or because some 
sense of one is partially synonymous to some sense of 
the other, or because the terms are semantical ly similar 

10 in some other way. 

Given a query q, we want to find an alignment a = 
(q1 , xi1 ), (q2, xi2), ... (qm, xim) of terms in the query with 
terms in the text such that 

15 (1 ) each pair consisting of a term from the query and 
a term from the text have a small similarity distance; 

(2) the terms in the text that are aligned with terms 
in the query occur near each other in the text; and 

(3) we rank such an alignment more highly if the 
20 term hits in the text occur in the order that their cor- 
responding query terms occur in the query. 

Alignments are also considered that have text cor- 
respondences for only some subset of the query terms, 

25 and they are ranked worse (penalized more) than align- 
ments that contain more of the query terms, by giving 
them penalties determined by the kind of term that is 
missing and/or the role that it plays in the query. 

A similarity distance metric is organized so that, giv- 

30 en a query term qi (either a single word or a phrase in- 
cluding a sequence of words), a function call is made 
that returns a list of term-distance pairs (t1 d1), (t2, 
d2), .... (tj, dj) in increasing order of the distance value 
dj, where dj is the similarity distance between the query 

35 term qi and the potential text term tj. Let us call this func- 
tion 'similar-terms". 

The text sequence x1, x2, xn is indexed in ad- 
vance, so that a function call "term-index" for a given 
term tj locates: (1) all of the documents in which that 

40 term occurs; and, for each document, (2) all of the po- 
sitions i at which a match for the term tj occurs in the 
text. If tj is a sequence of words w1 , w2 wp, then a match 
exist for tj at position i if xi = w1 , xi+1 = w2, .... and xi+p- 
1 = wp. 

45 For each term qi in the query q, a sequence of term 
hits (exact matches or entailing "close hits") is construct- 
ed for the term qi by combining the term-index entries 
for that term and for all of its similar (entailing) terms. 
Each of these term hits will have a weight or penalty cor- 

so responding to the similarity distance between the query 
term and the matching text term (or zero for exact 
matches of the term). 

Generally, the method for generating and returning 
hit passages for a given query q is as follows: 

55 

1 . Set up a generator of term hits for each significant 
term in the query (certain function words such as 
"of and "the" may be judged insignificant and ig- 
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nored). These generators will generate term hits in 
documents in which a term hit occurs in the order 
of the documents in the collection and within a doc- 
ument in the order of the position of the term hit with- 
in the document. 

2. Overall hit passages for the query q are generat- 
ed sequentially by starting at the position of the first 
similar term (t) generated by any of the terms of the 
query. This term hit may be referred to as the 'roof. 
Thus the root for the first hit passage is the earliest 
word in the earliest document in the collection that 
is a term hit for one of the terms in the query. Then 
the method inspects all of the term hits generated 
by any of the other terms in the query that are in the 
same document and within a window determined by 
a threshold proximity distance (the proximity hori- 
zon) from the position of the root term t. For each 
combination of term hits from the other (non-root) 
generators that occur within this window, a net pen- 
alty score for this combination is computed from the 
distances between the individual term hits, the sim- 
ilarity distances or match penalties involved in each 
of the term hits, syntactic information about the re- 
gion of the hit passage (such as whether there is a 
sentence or paragraph boundary contained in the 
hit passage) and an appropriate penalty for any 
term in the query that has no corresponding hit with- 
in the window (this penalty depending on the kind 
of word that is missing and/or its role in the query 
or frequency in the collection). These hit passages 
are also assigned a penalty for crossing a sentence 
boundary or crossing a paragraph boundary, de- 
pending on the parameter settings for sentence 
boundary penalty and paragraph boundary penalty. 
The best such combination is selected and gener- 
ated as a hit passage for the query. 

3. After generating a hit passage, the generator for 
the root term (t) is stepped to the next term hit for 
that term and the generators for all of the other 
terms in the query are restored to the values they 
had when the previous root term t was first selected. 
A new root is now selected (the earliest term hit of 
any of the currently generated term hits) and the 
process is repeated. 

4. This process of generating hit passages for the 
query is repeated either until a sufficient number of 
zero penalty hit passages has been generated (de- 
termined by a specified limit), or until there are no 
more term hits to generate, after which all of the hit 
passages that have been found are sorted by their 
net overall penalty. Hit passages that are contained 
within or overlap better hit passages or earlier hit 
passages with the same score are suppressed, and 
the best remaining hit passages (up to the specified 
limit) are presented to the information seeker in or- 
der of their overall penalty score (smallest penalty 
first). Alternatively, hit passages can be provided to 
a display window as they are generated and each 



new hit is inserted into the display at the appropriate 
rank position as it is encountered. To avoid replac- 
ing a displayed hit passage that overlaps with a later 
better hit passage, sending hit passages to the dis- 

5 play should be delayed until the search window has 
moved beyond the point of overlap. 
5. Each hit passage in the presented query hits list 
is displayed with its penalty score, a summary of the 
match criteria (including a list of the corresponding 

10 term hits for each query term), an identification of 
the position of the passage within its source docu- 
ment (such as a document id and the byte offsets 
of the beginning and end of the passage), and the 
text string of the retrieved passage. The retrieved 

is passage is determined by starting with the latest 
sentence or segment boundary in the source doc- 
ument that precedes the earliest term hit in this 
match and ends at the first sentence or segment 
boundary that follows the latest term hit. 

20 6. The displayed term hit list can be used to access 
a display of the retrieved passages in the context in 
which they occur. This is done by opening a viewing 
window on the document in which the passage oc- 
curs, positioning the text within the viewing window 

25 so that the retrieved passage is visible within it, 
highlighting the passage within the window, and if 
possible marking the term hits that justified the pas- 
sage so that they are visible to the user. 

30 Unlike conventional document retrieval, the system 
of the present invention locates specific passages of in- 
formation within the document, not simply the document 
itself. This is simitar to what has been called "passage 
retrieval" in information retrieval literature, but in the 

35 present invention the passages are constructed dynam- 
ically in response to the query using a general-purpose 
full-text index of terms and positions, and the size and 
granularity of the passage is variable depending on what 
is found in the match. 

40 

2D. Examples of Queries and Results. 

The following example is a portion of a summarized 
term hit list produced by an actual implementation of this 

45 method used by applicant, indexing the tutorial docu- 
mentation for the well-known Emacs text editor. In the 
listing, each hit entry comprises a data structure includ- 
ing a sequence number, a penalty score, a list of match- 
ing terms, the document in which the hit occurred, and 

50 the positions of the hit within the document in the follow- 
ing format: 

H inHiiHi + iiinu <h it sequence 
number> 

(hit <penalty score> <list of matching terms> 
55 <fj|e where hit was found> <beg inning posi- 

tion> 

<end position>) 
retrieved text passage> 
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Here are results generated for the query phrase 
"move to end of file', i.e. a search in a predefined doc- 
ument corpus for this phrase. (The document corpus in 
this example, as noted above, is a portion the Emacs 
text editor documentation.) 

The first three entries of the resulting hit list were: 

iiiiiiiiiiiiiiiiiiii i 

ftti I f ttt t ft"! i rtt rtl I 

(hit 0.11 5 ("GO" TO" "END" "FILE") Vhome/emacs- 
tutoriar 5881 5898) 
M-> Go to end of file 

i 1 1, 1 1 ,i 1 1 1 1 1 1 1 1 1 1 1 ^ 

Y I I I » I I 1 I T Til III m I 1 C— 

(hit 0.155 ("MOVES" "TO" "END" "FILE") "/home/ 
emacs-tutorial" 4984 5012) 
which moves to the end of the We. 

finiHHi m- n ii in 3 

(hit 2.849 ("DASHES" (MISSING TO) "ENDS" 

"FILE") 

'/home/emacs-tutoriar 15624 15753) 
begins and ends with dashes, and contains the 
string "Emacs: TUTORIAL', Your copy of the 
Emacs tutorial is called 'TUTORIAL m . Whatever file 
you find, that files name will appear in that precise 
spot. 

(The italicized portions above are the actual retrieved 
hit passages located as matches for the input query 
phrase "move to end of file".) 

The following excerpted portions of the associated 
text for the above results illustrate the display of the re- 
spective hit passages in context, in which the hit region 
(passage) is underlined , and the located term hits ap- 
pear in bold: 

No. 1. For hit 0.115 ("GO" TO" "END" "FILE"): 

M-a Move back to beginning of sentence 
M-e Move forward to end of sentence 
M-< Go to beginning of file 
M-> Go to end* of file 

» Try all of these commands now a few times for 
practice. Since the last two will take you away from 
this screen, you can come back here with MVs and 
C-v*s. These are the most often used commands. 

No. 2. For hit 0.155 ("MOVES" TO" "END" "FILE"): 

Two other simple cursor motion commands are: 
M-< (Meta Less-than), which moves to the begin- 
ning of the file, and M-> (Meta Greater-than), which 
moves to the end of the file . You probably donl need 
to try them, since finding this spot again will be bor- 
ing. On most terminals the "<" is above the comma 
and you must use the shift key to type it. On these 
terminals you must use the shift key to type M-< al- 
so; without the shift key, you would be typing M- 



comma. 

No. 3. For hit 2.849 ("DASHES" (MISSING TO) 
"ENDS" "FILE"): 

5 If you look near the bottom of the screen you 

will see a line that begins and ends with dashes, 
and contains the string 'Emacs TUTORIAL", Your 
copy of the Emacs tutorial is called "TUTORIAL". 
. Whatever file you find, that file's name will appear 

10 in that precise spot. 

There is a gradual relaxation from good matches to 
successively less likely matches, with appropriate pen- 
alty scores to indicate the degree of poorness of the 

is match. In this example, penalty scores greater than 2 
indicate substantial likelihood that the match is not use- 
ful. Note that the system is not sensitive to how context 
determines senses of words, so it accepts "dashes' as 
a specialization of "move" even though in this context it 

20 js clearly a plural noun rather than a verb. In contrast, 
in the first hit, "move" is correctly matched to the more 
specific term "go," while in the second, it correctly 
matches the inflected form "moves." 

The method of the invention thus finds passages 

25 within texts that contain answers to a specific informa- 
tion request, and ranks them by the degree to which they 
are estimated to contain the information sought. 

2E. Specilic Method for Generating Hit Passages in 
30 Order of Desired Ranking 

Figure 5 is* a top-level flow chart of the method of 
the invention. A search query is input at box 510, and at 
box 520 the method identifies target regions in the cor- 
35 pus that contain matches for the query (search) terms. 
This is carried out using the outputs of the term indexing 
modules 90 and 100 shown in Figure 2, according to the 
procedure detailed in Section 2F below. 

At box 530, the processor 20 fills the output buffer 
40 with the sorted list of query hits, in a procedure detailed 
in Figure 5 A and Section 2F below. The ranked list of 
hits is then displayed on display 50, and/or may be. 
stored as a file in mass storage for future use. 

At box 550, the actual hits are displayed and/or 
45 stored according to their assigned ranks. Hit terms are 
highlit, and hyperlinks are provided to targeted text, i.e. 
the documents in which the hit passages were located. 

This completes the processing of a given query, if 
there is another query, the method proceeds from box 
so 560 to box 510, and otherwise ends at box 570. 

2E Method for Identifying Target Regions and Sorting 
Query Hits 

55 This section discusses the method of the invention 
for carrying out step 520 of Figure 5. The following six 
steps are carried out to accomplish this. When the query 
is made, documents are located by using the results of 
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the index modules 90 and 100, as mentioned above, 
thus providing to the processor a series of documents 
within with matches for the query terms should be found. 
Within each such document in which query term match- 
es are found to occur, the following steps 0-6 are exe- 
cuted by the processor. Their operation becomes clear- 
er in the subsequent discussion of Figure 5 A. 

0. The proximity buffer is initially seeded with the 
first entailing term hit generated by the entailing 
term generator for this document and an operating 
parameter penalty-threshold is set to *maximum- 
penalty-threshold*, the maximum penalty that will 
be accepted for a query hit. (In the preferred em- 
bodiment, this parameter is set to 50. This param- 
eter can obviously be varied and can be made sub- 
ject to control by the user.) 

As mentioned above, the proximity buffer cor- 
responds to the "window" that the method effective- 
ly moves through a given document, defining re- 
gions of the document where term hits are to be 
found. The proximity buffer stores everything in a 
given window, as well as information identifying the 
size of the window and its position in the document. 
The "size" of the window may be defined by the be- 
ginning position of the window in the document plus 
the proximity horizon, i.e. the end of the window in 
the document, which is a variable position as dis- 
cussed below. 

1 . The proximity horizon is set based on the position 
of the first hit in the proximity buffer by adding the 
proximity window size determined for this query. 
The proximity buffer is then filled with all qualified 
entailing term hits, i.e. all of the entailing term hit 
occurrences that occur within the proximity horizon, 
by stepping the entailment term hit generator until 
the next hit would be beyond the proximity horizon 
or until there are no more entailing term hits. If an 
entailing term hit is generated that is beyond the 
proximity horizon, it is left in the generator store to 
be generated later These entailing term hits are 
generated by the method described below in Sec- 
tion 2H. 

In the preferred embodiment, the proximity ho- 
rizon is set to pick up entailing hits within a number 
of characters equal to: (a) the number of terms in 
the query times the parameter •proportional-prox- 
imity* (e.g. 1 00), if this parameter is set (by the user 
or an application); or to (b) a "proximity-threshold* 
(e.g. 300) number of characters from the position of 
the first hit in the buffer, if the proportional-proximity 
parameter is not set. These parameters can be var- 
ied or made to depend on the query in other ways, 
and can be made subject to control by either the 
user or an executing application or process, or both. 

2. The best scoring query hit that can be made from 



the current contents of the proximity buffer and 
whose penalty is less than the penalty-threshold is 
found by the method described below in Section 
2G. If no such match can be made, skip to step 6. 

5 

3. If this query hit scores no better than the worst 
hit in the output buffer and the output buffer is al- 
ready full, this hit is discarded and the method skips 
to step 6 below. If this query hit overlaps another 

10 query hit already in the output buffer, then that hit is 
replaced with this hit if this hit has a better score, or 
else this hit is discarded if its score is not better. 
Otherwise, this query hit is inserted into the output 
buffer at the appropriate rank according to its pen- 

15 alty score, throwing away the worst hit in the buffer 
if the buffer was already full. If the output buffer is 
now full, the parameter penalty -threshold is set to 
the worst query penalty in the output buffer. 

20 4. if the output buffer is full and the last hit has zero 
penalty, then the method stops generating hits and 
return the contents of the output buffer. 

5. If there are no more entailing hits to generate, 
25 then the method stops and returns the contents of 

the output buffer. 

6. Otherwise, the first term hit in the proximity buffer 
is removed from the proximity buffer, and the meth- 

30 od proceeds to step 1 . 

The foregoing summary of the method of identifying 
and sorting query hits is clarified by the flow chart of Fig- 
ure 5 A. In general, the method 600 involves the steps 

35 of moving a window on the document, the window hav- 
ing a fixed length depending upon the query size, and 
anchoring the window at some point on the document 
(beginning with the first entailng term hit). For each win- 
dow position, the method searches for a passage con- 

40 taining matches for the query terms. The best such 
matches are put in the output buffer until predetermined 
maximum number of perfect matches has been located, 
or until the search has exhausted all documents. 

At box 610 of Figure 5A, the method begins ktenti- 

45 fication of target regions contain ing matches for the que- 
ry terms. 

At box 620, the proximity buffer is seeded with the 
first entailing term hit for the current document, and at 
box 630 the penalty threshold is set to a predefined max- 

50 imum. An "entailing term hit" may be defined as follows: 
for each in the query, there is some set of terms in the 
term/concept relationship network that could entail that 
query term. A match for a given query term may include 
either that query term precisely or some other term that 

55 entails that query term. Either type of match is thus re- 
ferred to herein as an entailing term hit, and the set of 
all such entailing term hits relative to all such query 
terms may be referred to as the "entire entailing set'. 
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At box 640, the proximity horizon is set as discussed 
above, i.e. the window is positioned at the next entailing 
term hit for the current target passage. (At the first pass 
through this box, the "next" entailing term hit is the first 
entailing term hit.) At box 650, the proximity buffer is 
then filled with all qualified entailing term hits as defined 
in step 1 above. 

At box 660, the method determines whether there 
is any query hit that can be made from the term hits in 
the proximity buffer with a penalty better than (i.e. lower 
than) the current penalty threshold. On the first pass 
through, this will be a comparison with the predefined 
maximum penalty threshold. If there is no such query hit 
that can be made from the term hits within the proximity 
buffer, then the first hit in the proximity buffer is removed 
at box 740, and the proximity horizon is reset at box 640 
with the beginning of the window at the (new) first term 
in the proximity buffer. 

At box 650, the proximity buffer is again filled with 
qualified entailing term hits (defined in step 1 above), 
which in this example results in effectively moving the 
proximity window down one entailing term hit relative to 
the previous iteration of step 650. At box 660, it is again 
determined whether there is any query hit that can be 
made from the (new) contents of the proximity buffer 
with a penalty bwer than the current penalty threshold, 
and the process continues. 

If a query hit is found that meets this test, then the 
method proceeds to box 670, where the best query hit 
(i.e. the query hit with the lowest penalty) in the proximity 
buffer is designated as the "current" query hit. The best- 
scoring query hit in the proximity buffer is determined as 
described generally in Sections 2A-2C above, and a de- 
tailed procedure for doing so according to a preferred 
embodiment is set forth in Section 2G below. 

At box 680, it is determined whether the current 
query hit's penalty is better (bwer) than the worst hit in 
the output buffer (where the best query hits are stored 
in preparation for output to display or to a file upon com- 
pletion of the search procedure). If not, then the current 
query hit is discarded at box 730, the first query hit is 
removed from the proximity buffer at box 740, and the 
method proceeds back to box 640 as before, to reposi- 
tion the window for another try at a better query hit. 

If at box 680 the current query hit was better than 
the worst hit in the output buffer, then at box 690 any 
lower-scored overlaps are suppressed, meaning that 
any query hit whose target passage overlaps with the 
target passage of the current query hit is compared with 
the current query hit, and the query hit with the lower 
score (higher penalty) is discarded. If these two query 
hits have the same penalty score, then the first query hit 
is retained. 

At box 700, if the output buffer is full, then at box 
710 the processor discards the lowest -scoring entry in 
the output buffer. The method then proceeds to step 
720, where the current query hit is inserted into the out- 
put buffer. This is done by an insertion sort, i.e. the pen- 



alty of the current query hit is compared with the first hit 
in the output buffer, and if it is lower it is inserted above 
the latter and all the other hits are moved down. If not, 
then the current hifs penalty is compared with that of 

s the next hit in the output buffer, until one is found that 
the current hit's penalty exceeds, and the current hit is 
inserted at that point and the other hits are moved down. 
This ensures that the output buffer is always sorted upon 
insertion of the current hit. 

10 Other variations are possible, such as inserting by 
comparing with the lowest-scoring hit in the output buffer 
and moving up (coming from the opposite end, in effect), 
or doing a sort after the search is completed. Other sorts 
(such as tree sorts) would also be suitable; however, an 

15 insertion sort is one convenient method for comparing 
new current hit penalties with those already stored, and 
for filling the output buffer and sorting it simultaneously. 

At box 750, the method determines whether the out- 
put buffer is now full, given the addition of the latest cur- 

20 rent query hit. If it is, then the penalty threshold is set to 
that of the worst query in the output buffer (box 760), 
and in either case the method proceeds to box 770. Here 
it is determined whether the last query hit in the output 
buffer had zero penalty; if so, this indicates that the out- 

25 put buffer is full with zero-penatty hits, and there is no 
point in searching further, so the method proceeds to 
box 790, where the contents of the output buffer are re- 
turned, and the method proceeds back to step 540 for 
displaying, storing, etc. the hits, as before. Note that the 

30 size of the output buffer may be selected by the user or 
set by an executing process, so in general it is variable 
in size. 

If at box 770 the last query hit in the output buffer 
does not have a zero penalty, then at box 780 the meth- 

35 od determines whether there are any more entailing 
term hits to generate, i.e. whether all entailing term hits 
from the index have been exhausted. If there are no 
more hits to be generated, then the method proceeds to 
box 790. Otherwise, it proceeds to box 740, where the 

40 first entailing term hit is removed from the proximity buff- 
er, so as to reposition the proximity window to the next 
entailing term hit. The method then proceeds again to 
box 640. 

Upon completion of the method 600 of Figure 5, the 
45 output buffer is filled with query hits in a ranked order 
from best (lowest penalty) to worst. 

2G. Method for Determining Best-Scorinp Query Hit 

50 Following is a suitable method for determining 
which of the entailing term hits in the current proximity 
buffer can be used in conjunction with one another to 
form a query hit having the best score, i.e. the lowest 
aggregate or combined penalty. Thus, this method pro- 
55 vides a procedure for actually scoring the term hits lo- 
cated within a window on a document. 

A. Let q1 , q2 qm, be the successive query terms of 
the query q and let x1 ,x2, xn be the sequence of en- 



50 



55 



11 



21 



EP 0 752 676 A2 



22 



tailing term hits in the current proximity buffer (i.e., within 
the proximity horizon of the first entailing term hit in the 
proximity buffer). Search all possible alignments a = (q1 , 
xi1), (q2, xi2), ... (qm, xim) of terms in the query with 
entailing hits from the proximity buffer such that the first 
term xl in the proximity buffer is aligned with one of the 
query terms and each query term is paired with either 
one of the xij's in the proximity buffer that entails it or 
with a marker that indicates that it is missing. These 
alignments are searched in order to find the best ranking 
such hit - i.e., the hit with the lowest penalty score as 
assigned by the following ranking algorithm: 

B. For each pair (qj, xij) sum the following penalties: 

1. morphological variation penalty -- if qj and xij 
have the same morphological root, but are not the 
same inflected or derived form (i.e., are not either 
both root forms, or both singular nouns, or both third 
person singular verbs, etc.), then penalize each of 
the two that is not a root form by an amount deter- 
mined by the parameter *inflection-penarry* or *der- 
ivation-penalty* depending on whether the morpho- 
logical relationship involved is one of inflection or of 
derivation. (In the preferred embodiment, these 
penalties are 0.08 and 0.1, respectively. This com- 
ponent of the ranking penalty can obviously be 
modified to use different penalties or to incorporate 
different penalties for different kinds of inflection or 
derivational relationship.) 

2. taxonomic specialization penalty - if (the root of) 
qj is a more general term than (the root of) xij ac- 
cording to the subsumption taxonomy, then penal- 
ize the alignment by an amount determined by the 
parameter *descendants-penalty*. (In the preferred 
embodiment, this parameter is 0. 1 . This component 
of the ranking penalty can obviously be modified to 
use a different penalty or to incorporate a dimension 
of semantic distance between the more general 
term and the more specific term.) 

3. Semantic entailment penalty ~ if (the root of) qj 
is semantical ly entailed by (the root of) xij according 
to the known entailment relationships, then penal- 
ize the alignment by an amount determined by the 
parameter *entailment -penalty*, (In the preferred 
embodiment, this parameter is 0. 1 . This component 
of the ranking penalty can obviously be modified to 
use a different penalty or to incorporate a dimension 
of entailment strength between the query term and 
the entailing term.) 

4. missing term penalty - if (the root of) qj cannot 
be aligned with any of the xij terms in the proximity 
buffer by one of the above relationships (same mor- 
phological root, taxonomic specialization relation- 
ship between roots, or semantic entailment relation- 
ship between roots) and is therefore marked as 



missing, then penalize that term with a penalty de- 
termined as follows: 

if the term is in one of the following syntactic 
word classes: 

5 (adverb auxiliary conjunction initial in- 

terjection modal nameprefix operator possessive 
preposition pronoun punctuation title) 

then penalize it by •missing-qualifier- 
penalty* 

10 if the term is or can be a verb 

then penalize it by •missing-verb-penal- 
ty* 

if the term is one of the syntactic word classes 
(adjective, determiner) 
15 then penalize it by *missing-adjective- 

penalty* 

otherwise penalize it by *missing-term-penal- 

20 (in the preferred embodiment, the missing-qualifier- 
penalty is 2; the missing-verb-penalty is 5; the missing- 
adjective-penalty is 75; and the missing-term-penalty is 
10. This component of the ranking penalty can be mod- 
ified to use different penalties or different categories of 

25 penalties or to incorporate a dimension of term frequen- 
cy or term importance or syntactic role to determine the 
penalty for a missing term.) 

C. To the above accumulated penalties, add the fol- 
lowing penalties that are determined for the alignment 

30 as a whole: 

5. proximity ranking penalty - For each successive 
pair of entailing terms in the alignment in order of 
their occurrence in the text, penalize any gap be- 

35 tween them that is larger than a single character by 
an amount equal to the parameter *gap-penalty- 
f actor* times one less than the number of charac- 
ters between them. (In the preferred embodiment, 
this parameter is 0.005. This component of the 

40 ranking penalty can obviously be modified to use a 
different penalty factor or to use a word count or oth- 
er proximity measure other than a character count 
to measure the gap between words.) 

45 6. permutation penalty — For each successive pair 
of query terms, if the corresponding entailing terms 
in the alignment are not in the same order in the 
text, then penalize this hit by an amount equal to 
the parameter *out-of-order-penafty\ (In the pre- 

50 f erred embodiment, this parameter is 0.25. This 
component of the ranking penalty can obviously be 
modified to use a different penalty factor or to use 
various other measures of the degree to which the 
order of the terms in the hit is different from the order 

55 of terms in the query.) 

7. internal boundary penalty -- Scan the portion of 
the text covered by the region from the earliest en- 
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tailing hit of the alignment to the latest entailing hit 
of the alignment and for each sentence boundary 
or paragraph boundary contained in that portion of 
the text, add a penalty equal to the parameter 
*cross-sentence-penalty* or *cross-paragraph- 
penalty* depending on whether the boundary is an 
end of sentence or a paragraph boundary. (In the 
preferred embodiment, these parameters are 0.1 
and 50, respectively. This component of the ranking 
penalty can obviously be modified to use different 
penalties.) 

If at any point it can be determined that the penalty 
score of a part tally generated alignment is already worse 
than the score of some other alignment that can be gen- 
erated or is worse than the specified penalty threshold, 
then the inferior partial alignment can be discarded at 
that point and not considered further. There are many 
conventional techniques for performing such searches 
to be found in the literature on computer science search 
algorithms. 

D. Choose the alignment with the best (smallest) 
total penalty if one can be found that is better than the 
penalty threshold. This completes the penalty scoring of 
the terms, and hence the location of the best-scoring 
query hit from the current proximity buffer. 

2H. Method for Generating EntailinQ Term Hits 

This method utilizes the term/concept relationship 
network 110, which can either be constructed manually 
off-line or automatically constructed during the indexing 
process by the method described Section 1 , and further 
described in Section 21 below, using a knowledge base 
of manually constructed relationships and morphologi- 
cal rules. In this network, any given term that occurs in 
the corpus of indexed material or may occur in a query 
term is represented and may be associated with one or 
more concepts that the term in question may denote. 
These words and concepts in turn can be related to each 
other by the following morphological, taxonomic, and 
semantic entailment relationships: 

1. term x is a root form of an inflected or derived 
term y 

2. term or concept x taxonomically subsumes term 
or concept y (i.e., term or concept x is a more gen- 
eral term or concept than term or concept y). 

3. term or concept x may be entailed by term or con- 
cept v. 

In general, these relationships must be looked up 
in knowledge bases of such relationships (1 20, 1 50 and 
1 80), which are constructed off-line by data entry. Some 
morphological relationships, however, can be derived 
automatically by morphological rules applied to inflected 



and derived forms of words encountered in the text. 
Such morphological rules are generally part of the con- 
ventional systems in computational linguistics. 

The entailing terms for a query q = q1 , q2 qm (the 
5 "entire entailing set") will be the set of all terms that occur 
in the corpus that entail any of the terms qi in q, where 
a term x entails a term qi if any of the following hold: 

1 , x or a root of x is equal to qi or a root of qi 
10 2. x or a root of x taxonomically subsumes qi or a 
root of qi or a concept denoted by x or a root of x 
taxonomically subsumes qi or a root of qi or a con- 
cept denoted by qi or a root of qi 
3. x or a root of x is semanticalty entailed by qi or a 
is root of qi or a concept denoted by x or a root of x is 
semanticalty entailed by qi or a root of qi or a con- 
cept denoted by qi or a root of qi. 

o 

The entailing term hits for a query q = q1 , q2 qm will 

20 be the sequence of all term occurrences in the corpus 
that entail any of the terms qi in q or any concepts that 
are denoted by terms qi in q. These entailing term hits 
are generated in order of their occurrence in the corpus 
by creating a collection of generators for each entailing 

25 term, each of which will generate the occurrences of that 
term in order of their occurrence in the corpus (deter- 
mined first by a default ordering of all of the documents 
of the corpus and secondarily by the position of the term 
occurrence within a document). At any step of the gen- 

30 eration, the next generated entailing term hit is gener- 
ated by choosing the entailing term generator with the 
earliest hit available for generation and generating that 
term hit At the next step of generation, a different en- 
tailing term generator may have the earliest hit available 

35 to generate. This entailing term hit generator can be 
called repeatedly in order to find all of the entailing term 
hits that occur within a window of the corpus starting at 
some term occurrence in some file and continuing until 
some proximity horizon beyond that root term occur- 

40 rence has been reached. 

21. Generating the Term/Concept Relationship Network 

During indexing as described in Section 1 above (or 
45 jn a separate pass) as each word or phrase in the in- 
dexed material is encountered, it is looked up in a grow- 
ing term/ concept relationship network 1 1 0 of words and 
concepts and relationships among them that is being 
constructed as the corpus is analyzed. If the word or 
50 phrase is not already present in this term/concept rela- 
tionship network 110, it is added to it. 

The first time each such word or phase is encoun- 
tered, it is also looked up in manually constructed exter- 
nal knowledge bases of word and concept relationships 
55 (120, 150 and 180), and if it is found in these external 
networks, then all words and concepts in the external 
networks that are known to be entailed by this word or 
phrase or that are derived or inflected forms of this word 
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or phrase are added to the growing term/concept rela- 
tionship 110 network together with the known relation- 
ships among them. If such a word or phrase is not found 
in the external network, then it may be analyzed by mor- 
phological rules to determine if it is an inflected or de- 
rived form of a word that is known in the external knowl- 
edge bases (120, 150 and 180), and if so, its morpho- 
logical relationship to its root is recorded in the term/ 
concept relationship network and its root form is treated 
as if it had occurred in the corpus (i.e., that root is looked 
up in the external networks and all of its entailments, 
inflections, derivations, and relationships are added). 

At the end of this process, a term/concept relation- 
ship network will have been constructed that contains 
all of the terms that occur in the corpus plus all of the 
concepts entailed by or morphologically related to them, 
together with alt of the known morphological, taxonomic, 
and entailing relationships among them. This network is 
then used in processing queries to find entailing term 
hits for query terms. 

ZJ. Query Size Procedural Adaptation 

An embodiment of the system of the invention has 
in trial runs proven to be particularly effective for han- 
dling short queries of two or three words, or perhaps up 
to about six, in contrast to traditional retrieval methods, 
which are generally poor at handling short queries. 
Thus, a further enhancement of the invention may be 
had by using conventional word search techniques 
when one or more than some number N words are to be 
searched. The number N may be preset or may be se- 
lected by the user or a process in response to the suc- 
cess of the searching results, and may be 3-6 or more, 
depending upon the generated results. Such a system 
uses the best of both conventional techniques and the 
present invention, whose operation would thus be con- 
fined to the particularly difficult region of queries with just 
a few words. 

An embodiment of the system of the invention has 
in trial runs proven to be particularly effective for han- 
dling short queries of two or three words, or perhaps up 
to about six, in contrast to traditional retrieval methods, 
which are generally poor at handling short queries. 
Thus, a further enhancement may be had by using con- 
ventional word search techniques when one or more 
than some number N words are to be searched. The 
number N may be preset or may be selected by the user 
or a process in response to the success of the searching 
results, and may be 3-6 or more, depending upon the 
generated results. Such a system uses the best of both 
conventional techniques and the present invention, 
whose operation would thus be confined to the particu- 
larly difficult region of queries with just a few words. 

2J. Document Retrieval Application 

This passage retrieval technique can be applied to 



conventional document retrieval problems, to retrieve 
and rank documents by giving each document the score 
of the best passage it contains. 

While the present invention has been described 
s with reference to a few specific embodiments, the de- 
scription is illustrative of the invention and is not to be 
construed as limiting the invention. Various modifica- 
tions may occur to those skilled in the art within the 
scope of the invention. For example, although in the de- 
scribed embodiment the invention has been implement- 
ed using computer software, it will be appreciated that 
functions implemented by the software could be imple- 
mented in firmware or by means of special purpose 
hardware (e.g. ASICs) in alternative embodiments. 



Claims 

1. A method for locating information in documents in 
20 a database stored in a memory coupled to a proc- 
essor, the method being carried out by program 
steps executed by said processor, including the 
steps of: 

2$ (1) receiving a search query including at least 

one query term; 

(2) generating at least one hit passage from 
said documents, said hit passage including at 
least one hit term corresponding to said at least 

30 one query term; 

(3) for at least a first hit term and a second hit 
term corresponding, respectively, to at least a 
first query term and a second query term, de- 
termining a first distance between said first and 

35 second hit terms and a second distance be- 

tween said first and second query terms; 

(4) generating a factor having a magnitude 
based upon a comparison of said first distance 
with said second distance; and 

40 (5) generating a score for said hit passage in- 

corporating the magnitude of said factor. 

2. The method of claim 1 , wherein: 

45 step 2 includes the step of generating a plurality 

of said hit passages; 

step 3 is carried out for at least two said hit 
terms in each of said plurality of hit passages; 
step 4 is carried out for each of said distances 
so determined in step 3 for each set of correspond- 

ing hit terms and query terms; and 
step 5 is carried out for said plurality of hit pas- 
sages. 

55 3. The method of claim 2, further including the steps 
of: 

after step 5, determining a best-scored said hit 
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passage; and 

retrieving at least said best-scored hit passage. 

4. The method of claim 2, further including the steps 
of: 

after step 5, determining a best-scored said hit 
passage; and 

retrieving at least a document containing said 
best-scored hit passage. 

5. The method of claim 1 , wherein said score is gen- 
erated at least in part based upon a factor propor- 
tional to said first distance. 

6. The method of claim 1 , wherein said hit passage 
has a size based upon a size of said search query. 

7. The method of claim 1 , wherein said score is addi- 
tionally based upon a penalty generated from a 
measure of a semantic similarity between at least 
one query term and at least one hit term. 

8. The method of claim 1 , wherein said score is addi- 
tionally based upon a penalty generated from a 
comparison of the total number of query terms with 
the total number of hit terms. 

9. The method of claim 3, including the step of provid- 
ing at least one hyperlink in said retrieved passage, 
said hyperlink linked to the document containing 
said passage. 

10. A method for locating information in documents in 
a database stored in a memory coupled to a proc- 
essor, the method being carried out by program 
steps executed by said processor, including the 
steps of: 

(1) receiving a search query including at least 
a first query term and a second query term in a 
first order; 

(2) generating at least one hit passage from 
said documents, said hit passage including at 
least a first hit term corresponding to said first 
query term and a second hit term correspond- 
ing to said second query term, said first and 
second hit terms being in a second order; 

(3) generating a factor having a magnitude 
based upon a comparison of said first order with 
said second order; and 

(4) generating a score for said hit passage in- 
corporating the magnitude of said factor. 

11. The method of claim 10, further including the steps 
of: 

after step 4, determining a best-scored said hit 



passage; and 

retrieving at least said best-scored hit passage. 

12. The method of claim 10, further including the steps 
* of: 

after step 4, determining a best-scored said hit 
passage; and 

retrieving at least a document containing said 
10 best-scored hit passage. 

13. The method of claim 10, wherein said score is ad- 
ditionally based upon a penalty generated from a 
measure of a semantic similarity between at least 

is one query term and at least one hit term. 

14. The method of claim 10, wherein said score is ad- 
ditionally based upon a penalty generated from a 
comparison of the total number of query terms with 

20 the total number of hit terms. 

15. A method for locating information in documents in 
a database stored in a memory coupled to a proc- 
essor of a computer system, the computer system 

2S further including a proximity buffer and an output 
buffer coupled to said processor, the method being 
carried out by program steps executed by said proc- 
essor and including the steps of: 

30 (1) receiving a search query including at least 

one query term; 

(2) determining at least one target region of at 
least one said document in said database; 

(3) setting a penalty threshold to a predefined 
35 maximum; 

(4) determining query hits corresponding to 
said query terms within said target region and 
correlating with each said query hit a score re- 
flecting how closely it corresponds to its corre- 

40 sponding query term; 

(5) storing said query hits in said proximity buff- 
er, 

(6) designating a best-scoring query hit from 
said proximity buffer as a current query hit; 

45 (7) if said output buffer is full, discarding a low- 

est-scored query hit; 

(8) inserting said current query hit into said out- 
put buffer; 

(9) if the output buffer is now full, setting said 
50 penalty threshold to the score of a lowest- 
scored query hit in the output buffer; 

(10) if a predetermined criterion is met, then 
proceeding to step 13 and otherwise proceed- 
ing to step 1 1 ; 

55 (1 1 ) if there are more entailing term hits to gen- 

erate, then proceeding to step 1 2 and otherwise 
proceeding to step 1 3; 

(12) repositioning the target region relative to 
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said document, and proceeding to step 4; and 
(13) returning the contents of the output buffer 

16. The method of claim 1 5, wherein the predetermined 
criterion of step 1 0 is whether the last lowest-scored 
query hit in the output buffer has zero penalty, 

17. The method of claim 1 5, wherein the predetermined 
criterion of step 10 is whether all documents have 
been searched. 

18. A computer system for locating information in doc- 
uments in a database stored in a memory coupled 
to a processor of said computer system, including: 

a query module configured to receive a search 
query including a plurality of query terms; 
a retrieval module configured to retrieve pas- 
sages from said documents, each said pas- 
sage including at least one hit term correspond- 
ing to at least one said query term; 
a scoring module configured to generate 
scores for said passages based upon an order 
of occurrence of said query terms compared 
with an order of occurrence of hit terms appear- 
ing in said passages and corresponding to said 
query terms. 

19. A search system for retrieving and ranking passag- 
es of documents in a database, including: 

a retrieval module configured to retrieve pas- 
sages from said documents in response to a 
search query including at least one query term, 
each said passage including at least one hit 
term corresponding to at least one said query 
term; and 

a scoring module configured to generate 
scores for said passages based upon an order 
of occurrence of said query terms compared 
with an order of occurrence of hit terms appear- 
ing in said passages and corresponding to said 
query terms. 

20. A computer system for locating information in doc- 
uments in a database stored in a memory coupled 
to a processor of said computer system, including: 



21 . A search system for retrieving and ranking passag- 
es of documents in a database, including: 

a retrieval module configured to retrieve at least 
a first said passage from said documents in re- 
sponse to a search query including a plurality 
of query terms, said passage including at least 
two said hit terms corresponding to at least two 
said query terms; and 

a scoring module configured to generate 
scores for said passages based upon a factor 
having a magnitude incorporating a distance 
between said at least two said hit terms. 



75 



20 



25 



30 



35 



40 



a query module configured to receive a search 
query including a plurality of query terms; so 
a retrieval module configured to retrieve at least 
one passage from said documents, said pas- 
sage including at least two said hit terms cor- 
responding to at least two said query terms; 
a scoring module configured to generate ss 
scores for said passages based upon a factor 
having a magnitude incorporating a distance 
between said at least two said hit terms. 
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